Goto

Collaborating Authors

 latent semantic analysis


Assessing the Applicability of Natural Language Processing to Traditional Social Science Methodology: A Case Study in Identifying Strategic Signaling Patterns in Presidential Directives

LeMay, C., Lane, A., Seales, J., Winstead, M., Baty, S.

arXiv.org Artificial Intelligence

Our research investigates how Natural Language Processing (NLP) can be u sed to extract main topics from a larger corpus of written data, as applied to the case of identifying signaling themes in Presidential Directives (PDs) from the Reagan through Clinton administrations . Analysts and NLP both identified relevant documents, demonstrating the potential utility of NLPs in research involving large written corpuses. H owever, we also identified discrepancies between NLP and human - labeled results that indicate a need for more research to assess the validity of NLP in this use case . The research was conducted in 2023, and the rapidly evolving landscape of AIML means existing tools have improved and new tools have been developed; this research displays the inherent capabilities of a potentially dated AI tool in emerging social science applications .


Exploring Aviation Incident Narratives Using Topic Modeling and Clustering Techniques

Nanyonga, Aziida, Wasswa, Hassan, Turhan, Ugur, Joiner, Keith, Wild, Graham

arXiv.org Artificial Intelligence

Aviation safety is a global concern, requiring detailed investigations into incidents to understand contributing factors comprehensively. This study uses the National Transportation Safety Board (NTSB) dataset. It applies advanced natural language processing (NLP) techniques, including Latent Dirichlet Allocation (LDA), Non-Negative Matrix Factorization (NMF), Latent Semantic Analysis (LSA), Probabilistic Latent Semantic Analysis (pLSA), and K-means clustering. The main objectives are identifying latent themes, exploring semantic relationships, assessing probabilistic connections, and cluster incidents based on shared characteristics. This research contributes to aviation safety by providing insights into incident narratives and demonstrating the versatility of NLP and topic modelling techniques in extracting valuable information from complex datasets. The results, including topics identified from various techniques, provide an understanding of recurring themes. Comparative analysis reveals that LDA performed best with a coherence value of 0.597, pLSA of 0.583, LSA of 0.542, and NMF of 0.437. K-means clustering further reveals commonalities and unique insights into incident narratives. In conclusion, this study uncovers latent patterns and thematic structures within incident narratives, offering a comparative analysis of multiple-topic modelling techniques. Future research avenues include exploring temporal patterns, incorporating additional datasets, and developing predictive models for early identification of safety issues. This research lays the groundwork for enhancing the understanding and improvement of aviation safety by utilising the wealth of information embedded in incident narratives.


Comparative Analysis of Topic Modeling Techniques on ATSB Text Narratives Using Natural Language Processing

Nanyonga, Aziida, Wasswa, Hassan, Turhan, Ugur, Joiner, Keith, Wild, Graham

arXiv.org Artificial Intelligence

Improvements in aviation safety analysis call for innovative techniques to extract valuable insights from the abundance of textual data available in accident reports. This paper explores the application of four prominent topic modelling techniques, namely Probabilistic Latent Semantic Analysis (pLSA), Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), and Non-negative Matrix Factorization (NMF), to dissect aviation incident narratives using the Australian Transport Safety Bureau (ATSB) dataset. The study examines each technique's ability to unveil latent thematic structures within the data, providing safety professionals with a systematic approach to gain actionable insights. Through a comparative analysis, this research not only showcases the potential of these methods in aviation safety but also elucidates their distinct advantages and limitations.


Evaluating Text Summaries Generated by Large Language Models Using OpenAI's GPT

Shakil, Hassan, Mahi, Atqiya Munawara, Nguyen, Phuoc, Ortiz, Zeydy, Mardini, Mamoun T.

arXiv.org Artificial Intelligence

In the contemporary era characterized by a deluge of data, the intelligence community faces the challenge of information overload, needing to process vast amounts of information swiftly and effectively. The ability to generate succinct, clear, and actionable summaries from diverse data sources is crucial, as it often determines the success of strategic objectives in this information-rich environment. As the demand for systems capable of automating large-scale text summarization without compromising on quality or relevance intensifies, the role of such technologies becomes increasingly critical Liu and Lapata [2019]. Text summarization, a pivotal task within Natural Language Processing (NLP), has found widespread application across various domains, including news aggregation and the distillation of extensive documents into manageable summaries. The exponential growth in data underscores the utility of text summarization in enhancing content accessibility and comprehension, thus facilitating more efficient navigation through information landscapes Chouikhi and Alsuhaibani [2022].


The evolving of Data Science and the Saudi Arabia case. How much have we changed in 13 years?

Barahona, Igor

arXiv.org Machine Learning

A comprehensive examination of data science vocabulary usage over the past 13 years in this work is conducted. The investigation commences with a dataset comprising 16,018 abstracts that feature the term "data science" in either the title, abstract, or keywords. The study involves the identification of documents that introduce novel vocabulary and subsequently explores how this vocabulary has been incorporated into scientific literature. To achieve these objectives, I employ techniques such as Exploratory Data Analysis, Latent Semantic Analysis, Latent Dirichlet Analysis, and N-grams Analysis. A comparison of scientific publications between overall results and those specific to Saudi Arabia is presented. Based on how the vocabulary is utilized, representative articles are identified.


Automated Code Extraction from Discussion Board Text Dataset

Saravani, Sina Mahdipour, Ghaffari, Sadaf, Luther, Yanye, Folkestad, James, Moraes, Marcia

arXiv.org Artificial Intelligence

This study introduces and investigates the capabilities of three different text mining approaches, namely Latent Semantic Analysis, Latent Dirichlet Analysis, and Clustering Word Vectors, for automating code extraction from a relatively small discussion board dataset. We compare the outputs of each algorithm with a previous dataset that was manually coded by two human raters. The results show that even with a relatively small dataset, automated approaches can be an asset to course instructors by extracting some of the discussion codes, which can be used in Epistemic Network Analysis.


Unsupervised Broadcast News Summarization; a comparative study on Maximal Marginal Relevance (MMR) and Latent Semantic Analysis (LSA)

Ramezani, Majid, Shahryari, Mohammad-Salar, Feizi-Derakhshi, Amir-Reza, Feizi-Derakhshi, Mohammad-Reza

arXiv.org Artificial Intelligence

The methods of automatic speech summarization are classified into two groups: supervised and unsupervised methods. Supervised methods are based on a set of features, while unsupervised methods perform summarization based on a set of rules. Latent Semantic Analysis (LSA) and Maximal Marginal Relevance (MMR) are considered the most important and well-known unsupervised methods in automatic speech summarization. This study set out to investigate the performance of two aforementioned unsupervised methods in transcriptions of Persian broadcast news summarization. The results show that in generic summarization, LSA outperforms MMR, and in query-based summarization, MMR outperforms LSA in broadcast news summarization.


Uncovering Hidden Meaning: A Beginner's Guide to Latent Semantic Analysis

#artificialintelligence

If you have ever worked with text data, you have likely encountered the challenge of dealing with high-dimensional and sparse data. One popular solution to this problem is latent semantic analysis (LSA), also known as latent semantic indexing (LSI). LSA is a technique for extracting latent (hidden) semantics from a collection of documents or text data. It does this by mapping the documents into a lower-dimensional space, where the relationships between the documents and the underlying concepts they represent can be more easily understood. One of the key benefits of LSA is that it can handle large amounts of data efficiently and is robust to noise and sparse data.


A comparison of latent semantic analysis and correspondence analysis of document-term matrices

Qi, Qianqian, Hessen, David J., Deoskar, Tejaswini, van der Heijden, Peter G. M.

arXiv.org Artificial Intelligence

Latent semantic analysis (LSA) and correspondence analysis (CA) are two techniques that use a singular value decomposition (SVD) for dimensionality reduction. LSA has been extensively used to obtain low-dimensional representations that capture relationships among documents and terms. In this article, we present a theoretical analysis and comparison of the two techniques in the context of document-term matrices. We show that CA has some attractive properties as compared to LSA, for instance that effects of margins, i.e. sums of row elements and column elements, arising from differing document-lengths and term-frequencies are effectively eliminated, so that the CA solution is optimally suited to focus on relationships among documents and terms. A unifying framework is proposed that includes both CA and LSA as special cases. We empirically compare CA to various LSA based methods on text categorization in English and authorship attribution on historical Dutch texts, and find that CA performs significantly better. We also apply CA to a long-standing question regarding the authorship of the Dutch national anthem Wilhelmus and provide further support that it can be attributed to the author Datheen, amongst several contenders.


How to use Latent Semantic Analysis to classify documents

#artificialintelligence

The children were sitting in circle on the floor. "The flat hat has a number and a label that says parrots and battercakes" -- one of the kids screamed Every single child starts laughing. "Nooooo, it was the black cat is under the table and it eats carrots and pancakes" -- another child replied I realized only then that they were playing telephone (or broken telephone as we call it in Argentina). Human communication is complex, mainly because each person expresses themselves differently. We could speak the same language but use different slang, words, or expressions to convey the same message.